Anomaly database system for processing telemetry data

ABSTRACT

In some examples, an anomaly database system is provided for processing metrics in telemetry data. An example anomaly database system comprises a continuous data management (CDM) node, the CDM node including a metrics library for sending out system metrics in a sparse manner and a statistics relay for receiving streaming metrics from nodes in a node cluster, the node cluster including the CDM node, the statistics relay pushing the received metrics to a metrics collector. A sparse consumers module pulls metrics, from the metrics collector, pushed to the metrics collector by the statistics relay.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.17/068,209 by Brar et al., entitled “Anomaly Database System forProcessing Telemetry Data” and filed Oct. 12, 2020, which is assigned tothe assignee hereof and incorporated by reference herein in itsentirety.

FIELD

The present disclosure relates generally to computer architecturesoftware for a data management platform and, in some more particularaspects, to an anomaly database system for processing telemetry data.

BACKGROUND

The volume and complexity of data that is collected, analyzed and storedis increasing rapidly over time. The computer infrastructure used tohandle this data is also becoming more complex, with more processingpower and more portability. As a result, data management and storage isbecoming increasingly important. Significant issues of these processesinclude access to reliable data backup and storage, and fast datarecovery in cases of failure. Other aspects include data portabilityacross locations and platforms.

Telemetry data includes information about a system or a device and howit is configured including hardware attributes such as centralprocessing unit (CPU) usage, installed memory, and storage, as well asquality-related information such as uptime and sleep details and numbersof crashes or hangs. In incident reporting and remediation, for example,a lack of telemetric data can slow down Mean Time To Repair (MTTR) andonly be sufficient for identifying that a problem has occurred but notgood enough to identify a root cause of a failure.

BRIEF SUMMARY

In some examples, an anomaly database system is provided for processingmetrics in telemetry data. An example anomaly database system comprisesa continuous data management (CDM) node, the CDM node including ametrics library for sending out system metrics in a sparse manner; astatistics relay for receiving streaming metrics from nodes in a nodecluster, the node cluster including the CDM node, the statistics relaypushing the received metrics to a metrics collector; and a sparseconsumers module to pull metrics, from the metrics collector, pushed tothe metrics collector by the statistics relay.

In some examples, the sparse consumers module includes at least oneprocessor configured to run a sparse algorithm on the pulled metrics toreduce a number of data points. In some examples, the sparse algorithmis selected from a group of sparse algorithms comprising: a diff-valuealgorithm, a last-value-delta algorithm, a standard deviation bandalgorithm, a standard deviation band algorithm with a last-valuefallback, and a last-value-delta with percentile algorithm. In someexamples, values generated by the sparse algorithm are bounded andassigned a publication status based on falling within a bounded value.

In some examples, the anomaly database system further comprises a rollupmodule to enable read queries over a designated time range.

In some examples, the anomaly database system further comprising abaseline estimator to pre-compute baselines on the streaming metrics toenable anomaly detection, correlations and multi-series sparseness.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

Some embodiments are illustrated by way of example and not limitation inthe views of the accompanying drawing:

FIG. 1 depicts one embodiment of a networked computing environment inwhich the disclosed technology may be practiced, according to an exampleembodiment.

FIG. 2 depicts one embodiment of the server of FIG. 1 , according to anexample embodiment.

FIG. 3 depicts one embodiment of the storage appliance of FIG. 1 ,according to an example embodiment.

FIG. 4 shows an example cluster of a distributed decentralized database,according to some example embodiments.

FIG. 5 illustrates an aspect of the subject matter in accordance withone embodiment.

FIG. 6 illustrates an aspect of the subject matter in accordance withone embodiment.

FIG. 7 illustrates an aspect of the subject matter in accordance withone embodiment.

FIG. 8 illustrates an aspect of the subject matter in accordance withone embodiment.

FIG. 9 illustrates an aspect of the subject matter in accordance withone embodiment.

FIG. 10 illustrates an aspect of the subject matter in accordance withone embodiment.

FIG. 11 illustrates an aspect of the subject matter in accordance withone embodiment.

FIG. 12 illustrates an aspect of the subject matter in accordance withone embodiment.

FIG. 13 illustrates an aspect of the subject matter in accordance withone embodiment.

FIG. 14 illustrates an aspect of the subject matter in accordance withone embodiment.

FIG. 15 illustrates an aspect of the subject matter in accordance withone embodiment.

FIG. 16 illustrates an aspect of the subject matter in accordance withone embodiment.

FIG. 17 illustrates an aspect of the subject matter in accordance withone embodiment.

FIG. 18 illustrates an aspect of the subject matter in accordance withone embodiment.

FIG. 19 illustrates an aspect of the subject matter in accordance withone embodiment.

FIG. 20 illustrates an aspect of the subject matter in accordance withone embodiment.

FIG. 21 depicts a block flow chart indicating example operations in amethod, according to example embodiments.

FIG. 22 depicts a block diagram illustrating an example of a softwarearchitecture that may be installed on a machine, according to someexample embodiments.

FIG. 23 depicts a block diagram illustrating an architecture ofsoftware, according to an example embodiment

FIG. 24 illustrates a diagrammatic representation of a machine 1000 inthe form of a computer system within which a set of instructions may beexecuted for causing a machine to perform any one or more of themethodologies discussed herein, according to an example embodiment.

DETAILED DESCRIPTION

The description that follows includes systems, methods, techniques,instruction sequences, and computing machine program products thatembody illustrative embodiments of the present disclosure. In thefollowing description, for purposes of explanation, numerous specificdetails are set forth in order to provide a thorough understanding ofexample embodiments. It will be evident, however, to one skilled in theart that the present inventive subject matter may be practiced withoutthese specific details.

A portion of the disclosure of this patent document contains materialthat is subject to copyright protection. The copyright owner has noobjection to the facsimile reproduction by anyone of the patent documentor the patent disclosure, as it appears in the Patent and TrademarkOffice patent files or records, but otherwise reserves all copyrightrights whatsoever. The following notice applies to the software and dataas described below and in the drawings that form a part of thisdocument: Copyright Rubrik, Inc., 2020, All Rights Reserved.

It will be appreciated that some of the examples disclosed herein aredescribed in the context of virtual machines that are backed up by usingbase and incremental snapshots, for example. This should not necessarilybe regarded as limiting of the disclosures. The disclosures, systems andmethods described herein apply not only to virtual machines of all typesthat run a file system (for example), but also to network-attachedstorage (NAS) devices, physical machines (for example Linux servers),and databases.

FIG. 1 depicts one embodiment of a networked computing environment 100in which the disclosed technology may be practiced. As depicted, thenetworked computing environment 100 includes a data center 106, astorage appliance 102, and a computing device 108 in communication witheach other via one or more networks 128. The networked computingenvironment 100 may also include a plurality of computing devicesinterconnected through one or more networks 128. The one or morenetworks 128 may allow computing devices and/or storage devices toconnect to and communicate with other computing devices and/or otherstorage devices. In some cases, the networked computing environment 100may include other computing devices and/or other storage devices notshown. The other computing devices may include, for example, a mobilecomputing device, a non-mobile computing device, a server, awork-station, a laptop computer, a tablet computer, a desktop computer,or an information processing system. The other storage devices mayinclude, for example, a storage area network storage device, anetworked-attached storage device, a hard disk drive, a solid-statedrive, or a data storage system.

The data center 106 may include one or more servers, such as server 200,in communication with one or more storage devices, such as storagedevice 104. The one or more servers may also be in communication withone or more storage appliances, such as storage appliance 102. Theserver 200, storage device 104, and storage appliance 300 may be incommunication with each other via a networking fabric connecting serversand data storage units within the data center 106 to each other. Thestorage appliance 300 may include a data management system for backingup virtual machines and/or files within a virtualized infrastructure.The server 200 may be used to create and manage one or more virtualmachines associated with a virtualized infrastructure.

The one or more virtual machines may run various applications, such as adatabase application or a web server. The storage device 104 may includeone or more hardware storage devices for storing data, such as a harddisk drive (HDD), a magnetic tape drive, a solid-state drive (SSD), astorage area network (SAN) storage device, or a NAS device. In somecases, a data center, such as data center 106, may include thousands ofservers and/or data storage devices in communication with each other.The one or more data storage devices 104 may comprise a tiered datastorage infrastructure (or a portion of a tiered data storageinfrastructure). The tiered data storage infrastructure may allow forthe movement of data across different tiers of a data storageinfrastructure between higher-cost, higher-performance storage devices(e.g., solid-state drives and hard disk drives) and relativelylower-cost, lower-performance storage devices (e.g., magnetic tapedrives).

The one or more networks 128 may include a secure network such as anenterprise private network, an unsecure network such as a wireless opennetwork, a local area network (LAN), a wide area network (WAN), and theInternet. The one or more networks 128 may include a cellular network, amobile network, a wireless network, or a wired network. Each network ofthe one or more networks 128 may include hubs, bridges, routers,switches, and wired transmission media such as a direct-wiredconnection. The one or more networks 128 may include an extranet orother private network for securely sharing information or providingcontrolled access to applications or files.

A server, such as server 200, may allow a client to download informationor files (e.g., executable, text, application, audio, image, or videofiles) from the server 200 or to perform a search query related toparticular information stored on the server 200. In some cases, a servermay act as an application server or a file server. In general, server200 may refer to a hardware device that acts as the host in aclient-server relationship or a software process that shares a resourcewith or performs work for one or more clients.

One embodiment of server 200 includes a network interface 110, processor112, memory 114, disk 116, and virtualization manager 118 all incommunication with each other. Network interface 110 allows server 200to connect to one or more networks 128. Network interface 110 mayinclude a wireless network interface and/or a wired network interface.Processor 112 allows server 200 to execute computer-readableinstructions stored in memory 114 in order to perform processesdescribed herein. Processor 112 may include one or more processingunits, such as one or more central processing units (CPUs) and/or one ormore graphics processing units (GPUs). Memory 114 may comprise one ormore types of memory, which may include random access memory (RAM),static RAM (SRAM), dynamic RAM (DRAM), read-only memory (ROM),electrically erasable programmable ROM (EEPROM), Flash, etc. Disk 116may include a hard disk drive and/or a solid-state drive. Memory 114 anddisk 116 may comprise hardware storage devices.

The virtualization manager 118 may manage a virtualized infrastructureand perform management operations associated with the virtualizedinfrastructure. The virtualization manager 118 may manage theprovisioning of virtual machines running within the virtualizedinfrastructure and provide an interface to computing devices interactingwith the virtualized infrastructure. In one example, the virtualizationmanager 118 may set a virtual machine having a virtual disk into afrozen state in response to a snapshot request made via an applicationprogramming interface (API) by a storage appliance, such as storageappliance 300. Setting the virtual machine into a frozen state may allowa point in time snapshot of the virtual machine to be stored ortransferred. In one example, updates made to a virtual machine that hasbeen set into a frozen state may be written to a separate file (e.g., anupdate file) while the virtual disk may be set into a read-only state toprevent modifications to the virtual disk file while the virtual machineis in the frozen state.

The virtualization manager 118 may then transfer data associated withthe virtual machine (e.g., an image of the virtual machine or a portionof the image of the virtual disk file associated with the state of thevirtual disk at the point in time it is frozen) to a storage appliance(for example, a storage appliance 102 or storage appliance 300 of FIG. 1, described further below) in response to a request made by the storageappliance. After the data associated with the point in time snapshot ofthe virtual machine has been transferred to the storage appliance 300(for example), the virtual machine may be released from the frozen state(i.e., unfrozen) and the updates made to the virtual machine and storedin the separate file may be merged into the virtual disk file. Thevirtualization manager 118 may perform various virtual machine-relatedtasks, such as cloning virtual machines, creating new virtual machines,monitoring the state of virtual machines, moving virtual machinesbetween physical hosts for load balancing purposes, and facilitatingbackups of virtual machines.

One embodiment of a storage appliance 300 (or storage appliance 102)includes a network interface 120, processor 122, memory 124, and disk126 all in communication with each other. Network interface 120 allowsstorage appliance 300 to connect to one or more networks 128. Networkinterface 120 may include a wireless network interface and/or a wirednetwork interface. Processor 122 allows storage appliance 300 to executecomputer readable instructions stored in memory 124 in order to performprocesses described herein. Processor 122 may include one or moreprocessing units, such as one or more CPUs and/or one or more GPUs.Memory 124 may comprise one or more types of memory (e.g., RAM, SRAM,DRAM, ROM, EEPROM). Disk 126 may include a hard disk drive and/or asolid-state drive. Memory 124 and disk 126 may comprise hardware storagedevices.

In one embodiment, the storage appliance 300 may include four machines.Each of the four machines may include a multi-core CPU, 64 GB of RAM, a400 GB SSD, three 4 terabyte (TB) HDDs, and a network interfacecontroller. In this case, the four machines may be in communication withthe one or more networks 128 via the four network interface controllers.The four machines may comprise four nodes of a server cluster. Theserver cluster may comprise a set of physical machines that areconnected together via a network. The server cluster may be used forstoring data associated with a plurality of virtual machines, such asbackup data associated with different point-in-time versions of thevirtual machines.

The networked computing environment 100 may provide a cloud computingenvironment for one or more computing devices. Cloud computing may referto Internet-based computing, wherein shared resources, software, and/orinformation may be provided to one or more computing devices on-demandvia the Internet. The networked computing environment 100 may comprise acloud computing environment providing Software-as-a-Service (SaaS) orInfrastructure-as-a-Service (IaaS) services. SaaS may refer to asoftware distribution model in which applications are hosted by aservice provider and made available to end users over the Internet. Inone embodiment, the networked computing environment 100 may include avirtualized infrastructure that provides software, data processing,and/or data storage services to end users accessing the services via thenetworked computing environment 100. In one example, networked computingenvironment 100 may provide cloud-based work productivity orbusiness-related applications to a computing device, such as computingdevice 108. The storage appliance 102 may comprise a cloud-based datamanagement system for backing up virtual machines and/or files within avirtualized infrastructure, such as virtual machines running on server200/or files stored on server 200.

In some cases, networked computing environment 100 may provide remoteaccess to secure applications and files stored within data center 106from a remote computing device, such as computing device 108. The datacenter 106 may use an access control application to manage remote accessto protected resources, such as protected applications, databases, orfiles located within the data center 106. To facilitate remote access tosecure applications and files, a secure network connection may beestablished using a virtual private network (VPN). A VPN connection mayallow a remote computing device, such as computing device 108, tosecurely access data from a private network (e.g., from a company fileserver or mail server) using an unsecure public network or the Internet.The VPN connection may require client-side software (e.g., running onthe remote computing device) to establish and maintain the VPNconnection. The VPN client software may provide data encryption andencapsulation prior to the transmission of secure private networktraffic through the Internet.

In some embodiments, the storage appliance 300 may manage the extractionand storage of virtual machine snapshots associated with different pointin time versions of one or more virtual machines running within the datacenter 106. A snapshot of a virtual machine may correspond with a stateof the virtual machine at a particular point-in-time. In response to arestore command from the storage device 104, the storage appliance 300may restore a point-in-time version of a virtual machine or restorepoint-in-time versions of one or more files located on the virtualmachine and transmit the restored data to the server 200. In response toa mount command from the server 200, the storage appliance 300 may allowa point-in-time version of a virtual machine to be mounted and allow theserver 200 to read and/or modify data associated with the point-in-timeversion of the virtual machine. To improve storage density, the storageappliance 300 may deduplicate and compress data associated withdifferent versions of a virtual machine and/or deduplicate and compressdata associated with different virtual machines. To improve systemperformance, the storage appliance 300 may first store virtual machinesnapshots received from a virtualized environment in a cache, such as aflash-based cache. The cache may also store popular data or frequentlyaccessed data (e.g., based on a history of virtual machine restorations,incremental files associated with commonly restored virtual machineversions) and current day incremental files or incremental filescorresponding with snapshots captured within the past 24 hours.

An incremental file may comprise a forward incremental file or a reverseincremental file. A forward incremental file may include a set of datarepresenting changes that have occurred since an earlier point-in-timesnapshot of a virtual machine. To generate a snapshot of the virtualmachine corresponding with a forward incremental file, the forwardincremental file may be combined with an earlier point in time snapshotof the virtual machine (e.g., the forward incremental file may becombined with the last full image of the virtual machine that wascaptured before the forward incremental file was captured and any otherforward incremental files that were captured subsequent to the last fullimage and prior to the forward incremental file). A reverse incrementalfile may include a set of data representing changes from a laterpoint-in-time snapshot of a virtual machine. To generate a snapshot ofthe virtual machine corresponding with a reverse incremental file, thereverse incremental file may be combined with a later point-in-timesnapshot of the virtual machine (e.g., the reverse incremental file maybe combined with the most recent snapshot of the virtual machine and anyother reverse incremental files that were captured prior to the mostrecent snapshot and subsequent to the reverse incremental file).

The storage appliance 300 may provide a user interface (e.g., aweb-based interface or a graphical user interface) that displays virtualmachine backup information such as identifications of the virtualmachines protected and the historical versions or time machine views foreach of the virtual machines protected. A time machine view of a virtualmachine may include snapshots of the virtual machine over a plurality ofpoints in time. Each snapshot may comprise the state of the virtualmachine at a particular point in time. Each snapshot may correspond witha different version of the virtual machine (e.g., Version 1 of a virtualmachine may correspond with the state of the virtual machine at a firstpoint in time and Version 2 of the virtual machine may correspond withthe state of the virtual machine at a second point in time subsequent tothe first point in time).

The user interface may enable an end user of the storage appliance 300(e.g., a system administrator or a virtualization administrator) toselect a particular version of a virtual machine to be restored ormounted. When a particular version of a virtual machine has beenmounted, the particular version may be accessed by a client (e.g., avirtual machine, a physical machine, or a computing device) as if theparticular version was local to the client. A mounted version of avirtual machine may correspond with a mount point directory (e.g.,/snapshots/VM5Nersion23). In one example, the storage appliance 300 mayrun a network file system (NFS) server and make the particular version(or a copy of the particular version) of the virtual machine accessiblefor reading and/or writing. The end user of the storage appliance 300may then select the particular version to be mounted and run anapplication (e.g., a data analytics application) using the mountedversion of the virtual machine. In another example, the particularversion may be mounted as an iSCSI target.

FIG. 2 depicts one embodiment of server 200 of FIG. 1 . The server 200may comprise one server out of a plurality of servers that are networkedtogether within a data center (e.g., data center 106). In one example,the plurality of servers may be positioned within one or more serverracks within the data center. As depicted, the server 200 includeshardware-level components and software-level components. Thehardware-level components include one or more processors 202, one ormore memory 204, and one or more disks 206. The software-levelcomponents include a hypervisor 208, a virtualized infrastructuremanager 222, and one or more virtual machines, such as virtual machine220. The hypervisor 208 may comprise a native hypervisor or a hostedhypervisor. The hypervisor 208 may provide a virtual operating platformfor running one or more virtual machines, such as virtual machine 220.Virtual machine 220 includes a plurality of virtual hardware devicesincluding a virtual processor 210, a virtual memory 212, and a virtualdisk 214. The virtual disk 214 may comprise a file stored within the oneor more disks 206. In one example, a virtual machine 220 may include aplurality of virtual disks 214, with each virtual disk of the pluralityof virtual disks 214 associated with a different file stored on the oneor more disks 206. Virtual machine 220 may include a guest operatingsystem 216 that runs one or more applications, such as application 218.

The virtualized infrastructure manager 222, which may correspond withthe virtualization manager 118 in FIG. 1 , may run on a virtual machineor natively on the server 200. The virtual machine may, for example, beor include the virtual machine 220 or a virtual machine separate fromthe server 200. Other arrangements are possible. The virtualizedinfrastructure manager 222 may provide a centralized platform formanaging a virtualized infrastructure that includes a plurality ofvirtual machines. The virtualized infrastructure manager 222 may managethe provisioning of virtual machines running within the virtualizedinfrastructure and provide an interface to computing devices interactingwith the virtualized infrastructure. The virtualized infrastructuremanager 222 may perform various virtualized infrastructure relatedtasks, such as cloning virtual machines, creating new virtual machines,monitoring the state of virtual machines, and facilitating backups ofvirtual machines.

In one embodiment, the server 200 may use the virtualized infrastructuremanager 222 to facilitate backups for a plurality of virtual machines(e.g., eight different virtual machines) running on the server 200. Eachvirtual machine running on the server 200 may run its own guestoperating system and its own set of applications. Each virtual machinerunning on the server 200 may store its own set of files using one ormore virtual disks associated with the virtual machine (e.g., eachvirtual machine may include two virtual disks that are used for storingdata associated with the virtual machine).

In one embodiment, a data management application running on a storageappliance, such as storage appliance 102 in FIG. 1 or storage appliance300 in FIG. 1 , may request a snapshot of a virtual machine running onserver 200. The snapshot of the virtual machine may be stored as one ormore files, with each file associated with a virtual disk of the virtualmachine. A snapshot of a virtual machine may correspond with a state ofthe virtual machine at a particular point in time. The particular pointin time may be associated with a time stamp. In one example, a firstsnapshot of a virtual machine may correspond with a first state of thevirtual machine (including the state of applications and files stored onthe virtual machine) at a first point in time and a second snapshot ofthe virtual machine may correspond with a second state of the virtualmachine at a second point in time subsequent to the first point in time.

In response to a request for a snapshot of a virtual machine at aparticular point in time, the virtualized infrastructure manager 222 mayset the virtual machine into a frozen state or store a copy of thevirtual machine at the particular point in time. The virtualizedinfrastructure manager 222 may then transfer data associated with thevirtual machine (e.g., an image of the virtual machine or a portion ofthe image of the virtual machine) to the storage appliance 300 orstorage appliance 102. The data associated with the virtual machine mayinclude a set of files including a virtual disk file storing contents ofa virtual disk of the virtual machine at the particular point in timeand a virtual machine configuration file storing configuration settingsfor the virtual machine at the particular point in time. The contents ofthe virtual disk file may include the operating system used by thevirtual machine, local applications stored on the virtual disk, and userfiles (e.g., images and word processing documents). In some cases, thevirtualized infrastructure manager 222 may transfer a full image of thevirtual machine to the storage appliance 102 or storage appliance 300 ofFIG. 1 or a plurality of data blocks corresponding with the full image(e.g., to enable a full image-level backup of the virtual machine to bestored on the storage appliance). In other cases, the virtualizedinfrastructure manager 222 may transfer a portion of an image of thevirtual machine associated with data that has changed since an earlierpoint in time prior to the particular point in time or since a lastsnapshot of the virtual machine was taken. In one example, thevirtualized infrastructure manager 222 may transfer only data associatedwith virtual blocks stored on a virtual disk of the virtual machine thathave changed since the last snapshot of the virtual machine was taken.In one embodiment, the data management application may specify a firstpoint in time and a second point in time and the virtualizedinfrastructure manager 222 may output one or more virtual data blocksassociated with the virtual machine that have been modified between thefirst point in time and the second point in time.

In some embodiments, the server 200 or the hypervisor 208 maycommunicate with a storage appliance, such as storage appliance 102 inFIG. 1 or storage appliance 300 in FIG. 1 , using a distributed filesystem protocol such an NFS Version 3, or Server Message Block (SMB)protocol. The distributed file system protocol may allow the server 200or the hypervisor 208 to access, read, write, or modify files stored onthe storage appliance as if the files were locally stored on the server200. The distributed file system protocol may allow the server 200 orthe hypervisor 208 to mount a directory or a portion of a file systemlocated within the storage appliance.

FIG. 3 depicts one embodiment of storage appliance 300 in FIG. 1 . Thestorage appliance may include a plurality of physical machines that maybe grouped together and presented as a single computing system. Eachphysical machine of the plurality of physical machines may comprise anode in a cluster (e.g., a failover cluster). In one example, thestorage appliance may be positioned within a server rack within a datacenter. As depicted, the storage appliance 300 includes hardware-levelcomponents and software-level components. The hardware-level componentsinclude one or more physical machines, such as physical machine 314 andphysical machine 324. The physical machine 314 includes a networkinterface 316, processor 318, memory 320, and disk 322 all incommunication with each other. Processor 318 allows physical machine 314to execute computer readable instructions stored in memory 320 toperform processes described herein. Disk 322 may include a hard diskdrive and/or a solid-state drive. The physical machine 324 includes anetwork interface 326, processor 328, memory 330, and disk 332 all incommunication with each other. Processor 328 allows physical machine 324to execute computer readable instructions stored in memory 330 toperform processes described herein. Disk 332 may include a hard diskdrive and/or a solid-state drive. In some cases, disk 332 may include aflash-based SSD or a hybrid HDD/SSD drive. In one embodiment, thestorage appliance 300 may include a plurality of physical machinesarranged in a cluster (e.g., eight machines in a cluster). Each of theplurality of physical machines may include a plurality of multi-coreCPUs, 108 GB of RAM, a 500 GB SSD, four 4 TB HDDs, and a networkinterface controller.

In some embodiments, the plurality of physical machines may be used toimplement a cluster-based network fileserver. The cluster-based networkfile server may neither require nor use a front-end load balancer. Oneissue with using a front-end load balancer to host the internet protocol(IP) address for the cluster-based network file server and to forwardrequests to the nodes of the cluster-based network file server is thatthe front-end load balancer comprises a single point of failure for thecluster-based network file server. In some cases, the file systemprotocol used by a server, such as server 200 in FIG. 1 , or ahypervisor, such as hypervisor 208 in FIG. 2 , to communicate with thestorage appliance 300 may not provide a failover mechanism (e.g., NFSVersion 3). In the case that no failover mechanism is provided on theclient side, the hypervisor may not be able to connect to a new nodewithin a cluster in the event that the node connected to the hypervisorfails.

In some embodiments, each node in a cluster may be connected to eachother via a network and may be associated with one or more IP addresses(e.g., two different IP addresses may be assigned to each node). In oneexample, each node in the cluster may be assigned a permanent IP addressand a floating IP address and may be accessed using either the permanentIP address or the floating IP address. In this case, a hypervisor, suchas hypervisor 208 in FIG. 2 , may be configured with a first floating IPaddress associated with a first node in the cluster. The hypervisor mayconnect to the cluster using the first floating IP address. In oneexample, the hypervisor may communicate with the cluster using the NFSVersion 3 protocol. Each node in the cluster may run a Virtual RouterRedundancy Protocol (VRRP) daemon. A daemon may comprise a backgroundprocess. Each VRRP daemon may include a list of all floating IPaddresses available within the cluster. In the event that the first nodeassociated with the first floating IP address fails, one of the VRRPdaemons may automatically assume or pick up the first floating IPaddress if no other VRRP daemon has already assumed the first floatingIP address. Therefore, if the first node in the cluster fails orotherwise goes down, then one of the remaining VRRP daemons running onthe other nodes in the cluster may assume the first floating IP addressthat is used by the hypervisor for communicating with the cluster.

In order to determine which of the other nodes in the cluster willassume the first floating IP address, a VRRP priority may beestablished. In one example, given a number (N) of nodes in a clusterfrom node(0) to node(N−1), for a floating IP address (i), the VRRPpriority of nodeG) may be G-i) modulo N. In another example, given anumber (N) of nodes in a cluster from node(0) to node(N−1), for afloating IP address (i), the VRRP priority of nodeG) may be (i-j) moduloN. In these cases, nodeG) will assume floating IP address (i) only ifits VRRP priority is higher than that of any other node in the clusterthat is alive and announcing itself on the network. Thus, if a nodefails, then there may be a clear priority ordering for determining whichother node in the cluster will take over the failed node's floating IPaddress.

In some cases, a cluster may include a plurality of nodes and each nodeof the plurality of nodes may be assigned a different floating IPaddress. In this case, a first hypervisor may be configured with a firstfloating IP address associated with a first node in the cluster, asecond hypervisor may be configured with a second floating IP addressassociated with a second node in the cluster, and a third hypervisor maybe configured with a third floating IP address associated with a thirdnode in the cluster.

As depicted in FIG. 3 , the software-level components of the storageappliance 300 may include data management system 302, a virtualizationinterface 304, a distributed job scheduler 308, a distributed metadatastore 310, a distributed file system 312, and one or more virtualmachine search indexes, such as virtual machine search index 306. In oneembodiment, the software-level components of the storage appliance 300may be run using a dedicated hardware-based appliance. In anotherembodiment, the software-level components of the storage appliance 300may be run from the cloud (e.g., the software-level components may beinstalled on a cloud service provider).

In some cases, the data storage across a plurality of nodes in a cluster(e.g., the data storage available from the one or more physical machine(e.g., physical machine 314 and physical machine 324)) may be aggregatedand made available over a single file system namespace (e.g.,/snapshots/). A directory for each virtual machine protected using thestorage appliance 300 may be created (e.g., the directory for VirtualMachine A may be /snapshots/VM_A). Snapshots and other data associatedwith a virtual machine may reside within the directory for the virtualmachine. In one example, snapshots of a virtual machine may be stored insubdirectories of the directory (e.g., a first snapshot of VirtualMachine A may reside in/snapshots/VM_A/s1/ and a second snapshot ofVirtual Machine A may reside in /snapshots/VM_A/s2/).

The distributed file system 312 may present itself as a single filesystem, in which as new physical machines or nodes are added to thestorage appliance 300, the cluster may automatically discover theadditional nodes and automatically increase the available capacity ofthe file system for storing files and other data. Each file stored inthe distributed file system 312 may be partitioned into one or morechunks or shards. Each of the one or more chunks may be stored withinthe distributed file system 312 as a separate file. The files storedwithin the distributed file system 312 may be replicated or mirroredover a plurality of physical machines, thereby creating a load-balancedand fault tolerant distributed file system. In one example, storageappliance 300 may include ten physical machines arranged as a failovercluster and a first file corresponding with a snapshot of a virtualmachine (e.g., /snapshots/VM_A/s1/s1.full) may be replicated and storedon three of the ten machines.

The distributed metadata store 310 may include a distributed databasemanagement system that provides high availability without a single pointof failure. In one embodiment, the distributed metadata store 310 maycomprise a database, such as a distributed document-oriented database.The distributed metadata store 310 may be used as a distributed keyvalue storage system. In one example, the distributed metadata store 310may comprise a distributed NoSQL key value store database. In somecases, the distributed metadata store 310 may include a partitioned rowstore, in which rows are organized into tables or other collections ofrelated data held within a structured format within the key value storedatabase. A table (or a set of tables) may be used to store metadatainformation associated with one or more files stored within thedistributed file system 312. The metadata information may include thename of a file, a size of the file, file permissions associated with thefile, when the file was last modified, and file mapping informationassociated with an identification of the location of the file storedwithin a cluster of physical machines. In one embodiment, a new filecorresponding with a snapshot of a virtual machine may be stored withinthe distributed file system 312 and metadata associated with the newfile may be stored within the distributed metadata store 310. Thedistributed metadata store 310 may also be used to store a backupschedule for the virtual machine and a list of snapshots for the virtualmachine that are stored using the storage appliance 300.

In some cases, the distributed metadata store 310 may be used to manageone or more versions of a virtual machine. Each version of the virtualmachine may correspond with a full image snapshot of the virtual machinestored within the distributed file system 312 or an incremental snapshotof the virtual machine (e.g., a forward incremental or reverseincremental) stored within the distributed file system 312. In oneembodiment, the one or more versions of the virtual machine maycorrespond with a plurality of files. The plurality of files may includea single full image snapshot of the virtual machine and one or moreincremental aspects derived from the single full image snapshot. Thesingle full image snapshot of the virtual machine may be stored using afirst storage device of a first type (e.g., a HDD) and the one or moreincremental aspects derived from the single full image snapshot may bestored using a second storage device of a second type (e.g., an SSD). Inthis case, only a single full image needs to be stored and each versionof the virtual machine may be generated from the single full image orthe single full image combined with a subset of the one or moreincremental aspects. Furthermore, each version of the virtual machinemay be generated by performing a sequential read from the first storagedevice (e.g., reading a single file from a HDD) to acquire the fullimage and, in parallel, performing one or more reads from the secondstorage device (e.g., performing fast random reads from an SSD) toacquire the one or more incremental aspects.

The distributed job scheduler 308 may be used for scheduling backup jobsthat acquire and store virtual machine snapshots for one or more virtualmachines over time. The distributed job scheduler 308 may follow abackup schedule to back up an entire image of a virtual machine at aparticular point in time or one or more virtual disks associated withthe virtual machine at the particular point in time. In one example, thebackup schedule may specify that the virtual machine be backed up at asnapshot capture frequency, such as every two hours or every 24 hours.Each backup job may be associated with one or more tasks to be performedin a sequence. Each of the one or more tasks associated with a job maybe run on a particular node within a cluster. In some cases, thedistributed job scheduler 308 may schedule a specific job to be run on aparticular node based on data stored on the particular node. Forexample, the distributed job scheduler 308 may schedule a virtualmachine snapshot job to be run on a node in a cluster that is used tostore snapshots of the virtual machine in order to reduce networkcongestion.

The distributed job scheduler 308 may comprise a distributed faulttolerant job scheduler, in which jobs affected by node failures arerecovered and rescheduled to be run on available nodes. In oneembodiment, the distributed job scheduler 308 may be fully decentralizedand implemented without the existence of a master node. The distributedjob scheduler 308 may run job scheduling processes on each node in acluster or on a plurality of nodes in the cluster. In one example, thedistributed job scheduler 308 may run a first set of job schedulingprocesses on a first node in the cluster, a second set of job schedulingprocesses on a second node in the cluster, and a third set of jobscheduling processes on a third node in the cluster. The first set ofjob scheduling processes, the second set of job scheduling processes,and the third set of job scheduling processes may store informationregarding jobs, schedules, and the states of jobs using a metadatastore, such as distributed metadata store 310. In the event that thefirst node running the first set of job scheduling processes fails(e.g., due to a network failure or a physical machine failure), thestates of the jobs managed by the first set of job scheduling processesmay fail to be updated within a threshold period of time (e.g., a jobmay fail to be completed within 30 seconds or within minutes from beingstarted). In response to detecting jobs that have failed to be updatedwithin the threshold period of time, the distributed job scheduler 308may undo and restart the failed jobs on available nodes within thecluster.

The job scheduling processes running on at least a plurality of nodes ina cluster (e.g., on each available node in the cluster) may manage thescheduling and execution of a plurality of jobs. The job schedulingprocesses may include run processes for running jobs, cleanup processesfor cleaning up failed tasks, and rollback processes for rolling-back orundoing any actions or tasks performed by failed jobs. In oneembodiment, the job scheduling processes may detect that a particulartask for a particular job has failed and in response may perform acleanup process to clean up or remove the effects of the particular taskand then perform a rollback process that processes one or more completedtasks for the particular job in reverse order to undo the effects of theone or more completed tasks. Once the particular job with the failedtask has been undone, the job scheduling processes may restart theparticular job on an available node in the cluster.

The distributed job scheduler 308 may manage a job in which a series oftasks associated with the job are to be performed atomically (i.e.,partial execution of the series of tasks is not permitted). If theseries of tasks cannot be completely executed or there is any failurethat occurs to one of the series of tasks during execution (e.g., a harddisk associated with a physical machine fails or a network connection tothe physical machine fails), then the state of a data management systemmay be returned to a state as if none of the series of tasks was everperformed. The series of tasks may correspond with an ordering of tasksfor the series of tasks and the distributed job scheduler 308 may ensurethat each task of the series of tasks is executed based on the orderingof tasks. Tasks that do not have dependencies with each other may beexecuted in parallel.

In some cases, the distributed job scheduler 308 may schedule each taskof a series of tasks to be performed on a specific node in a cluster. Inother cases, the distributed job scheduler 308 may schedule a first taskof the series of tasks to be performed on a first node in a cluster anda second task of the series of tasks to be performed on a second node inthe cluster. In these cases, the first task may have to operate on afirst set of data (e.g., a first file stored in a file system) stored onthe first node and the second task may have to operate on a second setof data (e.g., metadata related to the first file that is stored in adatabase) stored on the second node. In some embodiments, one or moretasks associated with a job may have an affinity to a specific node in acluster.

In one example, if the one or more tasks require access to a databasethat has been replicated on three nodes in a cluster, then the one ormore tasks may be executed on one of the three nodes. In anotherexample, if the one or more tasks require access to multiple chunks ofdata associated with a virtual disk that has been replicated over fournodes in a cluster, then the one or more tasks may be executed on one ofthe four nodes. Thus, the distributed job scheduler 308 may assign oneor more tasks associated with a job to be executed on a particular nodein a cluster based on the location of data required to be accessed bythe one or more tasks.

In one embodiment, the distributed job scheduler 308 may manage a firstjob associated with capturing and storing a snapshot of a virtualmachine periodically (e.g., every 30 minutes). The first job may includeone or more tasks, such as communicating with a virtualizedinfrastructure manager, such as the virtualized infrastructure manager222 in FIG. 2 , to create a frozen copy of the virtual machine and totransfer one or more chunks (or one or more files) associated with thefrozen copy to a storage appliance, such as storage appliance 300 inFIG. 1 . The one or more tasks may also include generating metadata forthe one or more chunks, storing the metadata using the distributedmetadata store 310, storing the one or more chunks within thedistributed file system 312, and communicating with the virtualizedinfrastructure manager 222 that the frozen copy of the virtual machinemay be unfrozen or released from a frozen state. The metadata for afirst chunk of the one or more chunks may include information specifyinga version of the virtual machine associated with the frozen copy, a timeassociated with the version (e.g., the snapshot of the virtual machinewas taken at 5:30 p.m. on Jun. 29, 2018), and a file path to where thefirst chunk is stored within the distributed file system 92 (e.g., thefirst chunk is located at/snapshotsNM_B/s1/s1.chunk1). The one or moretasks may also include deduplication, compression (e.g., using alossless data compression algorithm such as LZ4 or LZ77), decompression,encryption (e.g., using a symmetric key algorithm such as Triple DataEncryption Algorithm (DES) or Advanced Encryption Standard (AES)-256),and decryption related tasks.

The virtualization interface 304 may provide an interface forcommunicating with a virtualized infrastructure manager managing avirtualization infrastructure, such as virtualized infrastructuremanager 222 in FIG. 2 , and requesting data associated with virtualmachine snapshots from the virtualization infrastructure. Thevirtualization interface 304 may communicate with the virtualizedinfrastructure manager using an Application Programming Interface (API)for accessing the virtualized infrastructure manager (e.g., tocommunicate a request for a snapshot of a virtual machine). In thiscase, storage appliance 300 may request and receive data from avirtualized infrastructure without requiring agent software to beinstalled or running on virtual machines within the virtualizedinfrastructure. The virtualization interface 304 may request dataassociated with virtual blocks stored on a virtual disk of the virtualmachine that have changed since a last snapshot of the virtual machinewas taken or since a specified prior point in time. Therefore, in somecases, if a snapshot of a virtual machine is the first snapshot taken ofthe virtual machine, then a full image of the virtual machine may betransferred to the storage appliance. However, if the snapshot of thevirtual machine is not the first snapshot taken of the virtual machine,then only the data blocks of the virtual machine that have changed sincea prior snapshot was taken may be transferred to the storage appliance.

The virtual machine search index 306 may include a list of files thathave been stored using a virtual machine and a version history for eachof the files in the list. Each version of a file may be mapped to theearliest point-in-time snapshot of the virtual machine that includes theversion of the file or to a snapshot of the virtual machine thatincludes the version of the file (e.g., the latest point in timesnapshot of the virtual machine that includes the version of the file).In one example, the virtual machine search index 306 may be used toidentify a version of the virtual machine that includes a particularversion of a file (e.g., a particular version of a database, aspreadsheet, or a word processing document). In some cases, each of thevirtual machines that are backed up or protected using storage appliance300 may have a corresponding virtual machine search index.

In one embodiment, as each snapshot of a virtual machine is ingested,each virtual disk associated with the virtual machine is parsed in orderto identify a file system type associated with the virtual disk and toextract metadata (e.g., file system metadata) for each file stored onthe virtual disk. The metadata may include information for locating andretrieving each file from the virtual disk. The metadata may alsoinclude a name of a file, the size of the file, the last time at whichthe file was modified, and a content checksum for the file. Each filethat has been added, deleted, or modified since a previous snapshot wascaptured may be determined using the metadata (e.g., by comparing thetime at which a file was last modified with a time associated with theprevious snapshot). Thus, for every file that has existed within any ofthe snapshots of the virtual machine, a virtual machine search index maybe used to identify when the file was first created (e.g., correspondingwith a first version of the file) and at what times the file wasmodified (e.g., corresponding with subsequent versions of the file).Each version of the file may be mapped to a particular version of thevirtual machine that stores that version of the file.

In some cases, if a virtual machine includes a plurality of virtualdisks, then a virtual machine search index may be generated for eachvirtual disk of the plurality of virtual disks. For example, a firstvirtual machine search index may catalog and map files located on afirst virtual disk of the plurality of virtual disks and a secondvirtual machine search index may catalog and map files located on asecond virtual disk of the plurality of virtual disks. In this case, aglobal file catalog or a global virtual machine search index for thevirtual machine may include the first virtual machine search index andthe second virtual machine search index. A global file catalog may bestored for each virtual machine backed up by a storage appliance withina file system, such as distributed file system 312 in FIG. 3 .

The data management system 302 may comprise an application running onthe storage appliance 300 that manages and stores one or more snapshotsof a virtual machine. In one example, the data management system 302 maycomprise a highest-level layer in an integrated software stack runningon the storage appliance. The integrated software stack may include thedata management system 302, the virtualization interface 304, thedistributed job scheduler 308, the distributed metadata store 310, andthe distributed file system 312.

In some cases, the integrated software stack may run on other computingdevices, such as a server or computing device 108 in FIG. 1 . The datamanagement system 302 may use the virtualization interface 304, thedistributed job scheduler 308, the distributed metadata store 310, andthe distributed file system 312 to manage and store one or moresnapshots of a virtual machine. Each snapshot of the virtual machine maycorrespond with a point-in-time version of the virtual machine. The datamanagement system 302 may generate and manage a list of versions for thevirtual machine. Each version of the virtual machine may map to orreference one or more chunks and/or one or more files stored within thedistributed file system 312. Combined together, the one or more chunksand/or the one or more files stored within the distributed file system312 may comprise a full image of the version of the virtual machine.

FIG. 4 shows an example cluster 400 of a distributed decentralizeddatabase, according to some example embodiments. As illustrated, theexample cluster 400 includes five nodes, nodes 1-5. In some exampleembodiments, each of the five nodes runs from different machines, suchas physical machine 314 in FIG. 3 or virtual machine 220 in FIG. 2 . Thenodes in the example cluster 400 can include instances of peer nodes ofa distributed database (e.g., cluster-based database, distributeddecentralized database management system, a NoSQL database, ApacheCassandra, DataStax, MongoDB, CouchDB), according to some exampleembodiments. The distributed database system is distributed in that datais sharded or distributed across the example cluster 400 in shards orchunks and decentralized in that there is no central storage device andno single point of failure. The system operates under an assumption thatmultiple nodes may go down, up, or become non-responsive, and so on.Sharding is splitting up of the data horizontally and managing eachshard separately on different nodes. For example, if the data managed bythe example cluster 400 can be indexed using the 26 letters of thealphabet, node 1 can manage a first shard that handles records thatstart with A through E, node 2 can manage a second shard that handlesrecords that start with F through J, and so on.

In some example embodiments, data written to one of the nodes isreplicated to one or more other nodes per a replication protocol of theexample cluster 400. For example, data written to node 1 can bereplicated to nodes 2 and 3. If node 1 prematurely terminates, node 2and/or 3 can be used to provide the replicated data. In some exampleembodiments, each node of example cluster 400 frequently exchanges stateinformation about itself and other nodes across the example cluster 400using gossip protocol. Gossip protocol is a peer-to-peer communicationprotocol in which each node randomly shares (e.g., communicates,requests, transmits) location and state information about the othernodes in a given cluster.

Writing: For a given node, a sequentially written commit log capturesthe write activity to ensure data durability. The data is then writtento an in-memory structure (e.g., a memtable, write-back cache). Eachtime the in-memory structure is full, the data is written to disk in aSorted String Table data file. In some example embodiments, writes areautomatically partitioned and replicated throughout the example cluster400.

Reading: Any node of example cluster 400 can receive a read request(e.g., query) from an external client. If the node that receives theread request manages the data requested, the node provides the requesteddata. If the node does not manage the data, the node determines whichnode manages the requested data. The node that received the read requestthen acts as a proxy between the requesting entity and the node thatmanages the data (e.g., the node that manages the data sends the data tothe proxy node, which then provides the data to an external entity thatgenerated the request).

The distributed decentralized database system is decentralized in thatthere is no single point of failure due to the nodes being symmetricaland seamlessly replaceable. For example, whereas conventionaldistributed data implementations have nodes with different functions(e.g., master/slave nodes, asymmetrical database nodes, federateddatabases), the nodes of example cluster 400 are configured to functionthe same way (e.g., as symmetrical peer database nodes that communicatevia gossip protocol, such as Cassandra nodes) with no single point offailure. If one of the nodes in example cluster 400 terminatesprematurely (“goes down”), another node can rapidly take the place ofthe terminated node without disrupting service. The example cluster 400can be a container for a keyspace, which is a container for data in thedistributed decentralized database system (e.g., whereas a database is acontainer for containers in conventional relational databases, theCassandra keyspace is a container for a Cassandra database system).

Some examples provide observability for operational activity (includingtelemetric data) of all nodes in the example cluster 400, and/or in agroup of example cluster 400, and/or in the cloud. The telemetrygenerated may exist in various forms such as metrics, logs, traces, andstructured product data. Some examples herein focus on the collectionand observability of metrics data. A significant challenge arising inconventional systems can be an inability to provide full observabilityfor all metrics collected. For examples, metrics may be collected atabout 70 kilobytes (kb) per node but rendered visible at only 1 kb pernode. In other words, 98.6% of metrics are dropped leading to a gap invisibility. In other examples, metrics may be collected at a minutegranularity, but aggregated over 10 minutes, for example. Here, 90% ofreported data is lost. In other examples, metrics data may be ephemeral.In order to identify a cause of failure, most issue investigations needto drill down to a specific job or snapshot id or mailbox. However, thisset of data can become extremely large and break in-memory indexes. Ifthis occurs, there is no support for low level metrics on a node or inthe cloud. In other examples, from a data points per minute (DPM)reporting perspective, 99.85% of metrics data may be lost. In order tocollect all metric data and render it visible while keeping operationalbudgets constant, examples may reduce costs significantly. A primaryroadblock to supporting full data is cost which is directly related toDPM and a number of time series (cardinality). Thus, some examplesherein recognize a business need for lots of metrics and a desire to payless for this, while deriving and supporting rich tags in a tag-basedschema. In some examples, cost savings are achieving by usingapplication-aware compression of metrics; in some examples at theirsource (through a library for example) and, if not in the cloud, byusing sparseness estimators

In some examples of an anomaly database in telemetry systems andapplications, the scale and cost of the database are drive by twoimportant measures: DPM (data points generated per minute) and acardinality of series (relating to the metadata associated with datathat needs to be indexed for powering read queries). In thisspecification, an anomaly database may also be referred to as a timeseries data bases (TSDB), or an anomaly database system 1402 (FIG. 14 ),or a component thereof depending on the context in which this element isdescribed. In other words, handling large time series datasets incurstwo key challenges: (a) a data storage problem, i.e. how to store somuch data cost effectively, and (b) an indexing problem, i.e. how toindex so many series cost effectively. In seeking to meet significantcost-cutting goals, some examples reduce DPM by an order of magnitude.To that end, some examples employ “sparse data. In a general sense, someexamples identify “normal looking” data and compress this to savebillions of metrics and storage space. “Abnormal-looking” data isexamined, but such data may be harder to detect.

A data problem that the sparse data seeks to address can arise when atime series (under a service level agreement (SLA) for example) isrequired to be reported by a database at a fixed interval regardless ofwhether the relevant value is changing or not. For example, if a serviceconsistently uses 1 megabyte (MB) of RAM and reports memory used every 1minute, it may post the same data point, just with a newer epoch [1] [2](cardinality). Some databases may allow a user to configure a reportinginterval, however if a user chooses a large interval precision is lost,and if the user chooses a small interval then cost increases. Thus, somesparse data examples dynamically tune this interval [3] [4] [5]algorithmically on a per data point level to optimize for lowest costwhile maximizing data fidelity. In other words, data points that do notcontribute any significant information beyond what is already known aredropped.

To illustrate, with reference to the graph 502 of FIG. 5 , assume anexample time series (synthetically generated using a random walk) isgenerated at 1 minute interval. This reflects the data generation from acontinuous data management (CDM) node (for example, in an examplecluster 400). On the cloud this series may be aggregated to a 10 minuteinterval to reduce the data points (cost perspective). As is evident,significant information may be lost as part of this aggregation.

With reference to the graph 602 of FIG. 6 , a sparseness algorithm(based on a last value delta) the illustrated series will be generated.It will be noted that the sparseness approach more accurately capturesthe significant variations compared to the fixed interval aggregation.

With reference to FIG. 7 , the chart 702 and chart 704 show a meansquared error (MSE) for the original and sparse estimators of FIG. 5 andFIG. 6 and a reduction in data points (e.g., DPM) achieved. For such aseries, the sparseness algorithm delivers a similar reduction in datapoints but at a far higher accuracy.

With reference to FIG. 8 , the graph 802 represents a further exampletime series (synthetically generated using a random walk) generated at 1minute intervals. Here, a much lower variance between the results of theoriginal and sparseness approaches may be observed. There are very fewoutliers. A visual comparison of the respective MSEs in chart 806 andthe DPM results shown in chart 804 indicate that data fidelity waspreserved (i.e. lower MSE values) while yielding an improved reductionin the level of data points used.

A more extreme case is shown in the graph 902 of FIG. 9 in which thereis very little variance, and in which the example yields the greatesteffect by using sparse algorithms to preserve data fidelity in terms ofreduction of data volume. Significant improvement in both aspects areachieved as shown in chart 904 and chart 906. Such low variance seriesare common in ephemeral or high dimensional data sets.

As mentioned above, most metric data is generated at 1 min granularity,but fidelity can be lost through aggregation at 10 minutes. For someexample system counters, the aggregation step is no different than thesparseness algorithm proposed. In a 10 minute interval the first 9points may be dropped with only the last point let through. Insparseness examples, significant increments in count can pass throughand this may happen every minute or only twice a day, for example. So,some sparseness examples can provide better fidelity when the systemcounter is changing rapidly.

Some examples have an impact of transformations. Mathematically speakingsparseness introduces an error (∈) in the value at time t. Assume for aparticular metric (M) we receive a value mt0 at time t0 and mt1 at timet1. A sparseness algorithm (see below for examples) analyzes the valuemt1 received at t1 and may drop if the change is not significant. Inother words, an estimate of value at time t1 is mt0+∈, where theassumption is ∈ is capped depending on the sparse algorithm and itsconfiguration. When the value is directly visualized the assumption is ∈error introduced is insignificant and does not impact an ability totroubleshoot or detect problems. However if this value is not directlyvisualized, but another transformation (f( ) has to be applied thenunexpected behavior may arise and be detectable since the valuevisualized is not mt0+∈ but f(mt0+∈)=f(mt0)+E. In this case the errorintroduced can be arbitrary and may not be tolerable.

As an illustration if f( ) is antilog the E can be extremely large.Another example is if M is a counter and f is derivative or f( ) can bea difference between two counters.[1] [2] [3]. As a general rule, someexamples exclude all metrics that are transformed before visualizingfrom sparseness algorithms. However this may exclude a vast chunk ofpotentially available metrics. Instead of excluding this, some exampleswork on providing these transformations as part of a write path, soexamples only write the transformed value (end user consumable), andthis can be safely converted to a sparse form.

In some examples, the transformations may be applied to counter types,mainly because a counter by itself may be hard to interpret and onlycarries value when transformed through a derivative to compute the rateof change or as a difference between counters to estimate queue lengths.Such transformations are generally of relatively low cost and convenientto perform at write time. Example transformations may include a diffseries transformation in which counters are used to compute a queuelength using DIFFSERIES. Another example includes a derivativetransformation example in which a non-negative-derivative is more stablewith lower variance between nodes.

Some sparseness examples are configured for types of metric data. Forexample, a gauge is an instantaneous measurement of a value. Forexample, a CPU utilization. Many gauges (type of metric data) aresampled at a certain frequency (e.g. CPU utilization is captured everyminute), and thus can conveniently be the subject of a sparsenessalgorithm without losing critical info.

In another example metric type, almost all counters are monotonicallyincreasing counters. One sparseness example may seek a more efficientway of measuring a pending job in a queue. Here, a sparseness algorithmis not applied directly on counter values, but instead performs apre-transformation. For example, a sparseness algorithm may be appliedafter transforming the counter through a derivative (convert it to ameter essentially) or any other write time transformations possible. Forthe counters that remain, examples either leave them untouched or applya derivative based sparseness check.

In another example metric type, a meter measures the rate of events overtime (e.g., “requests per second”). In addition to the mean rate, metersalso track 1-, 5-, and 15-minute moving averages. Sparseness examplesmay only preserve the lowest granularity rate and drop the highergranularity rates since they can be estimated from the lower granularityrate. Meters are usually end user consumable and, in some examples,sparseness algorithms are applied by default to meters.

In another example metric type, a histogram measures a statisticaldistribution of values in a stream of data. In addition to minimum,maximum, and mean values (for example), a histogram may also measuremedian, 75th, 90th, 95th, 98th, 99th, and 99.9th percentiles. Histogramsare by definition approximate data structures, so it is safe to subjectthem to sparseness algorithms and, in some examples, these are appliedby default. Regardless of type, some examples may include one or morespecified exclusion rules to skip sparseness algorithms for specificmetrics.

Broadly, sparseness algorithms may be used in two ways to create sparsedata: per series level and across series. In series level sparseness,examples consider data points reported in a single time series and droppoints that do not show a significant change compared to the last knownvalue. The two graphs in FIG. 10 illustrate this, they represent thesame time-series. The top graph 1002 (darker background) represents atraditional time series where at fixed interval a data point wasreported. The graph 1004 (lighter background) represents the same seriesas a sparse series where data points are only reported when there is asignificant change from last known value. The graph 1004 includes about50% fewer data points reported, but still preserves significantinformation.

In a multi-series approach, some examples are very aggressive indropping data points and may be introduced by opt-in only, for examplewhere metrics can tolerate large drops. This approach is also morerelevant to high dimensional metrics where the cardinality can beextremely large and dense metrics can become prohibitively costly tostore. This approach also closely fits with the general theme of presentexamples, namely store outliers precisely while only keeping normalbands of values. Some multi-series sparseness examples learn or exhibita significance across multiple dimensions of the same metric. Take theCPU example again, a change from 2% to 5% is quite significant whenlooked at from perspective of a single time-series, however whenobserved over the node dimension grouped by version across 30 Ktimeseries the standard deviation may be 5% itself in which case the 3%delta is well within normal range and not significant. Further, inmulti-series sparseness, baseline and normal expected ranges may bedetermined across dimensional space and examples drop all points of anindividual series that lies within an expected range, while preservingthe points that fall outside of the range.

The respective examples in FIG. 11 and FIG. 12 illustrate this concept.An output 1102 of a single-series approach is shown in FIG. 11 .

After a multi-series sparseness approach of FIG. 12 , most of the seriesis represented by the expected band 1202 except the case where anoutlier 1204 is shown. As a more practical example, a CPU utilizationacross 30,000 customer deployments can be represented fairly accuratelyby approximately 10 band 1202. Effectively that translates to areduction of approximately 3000 times in data points needed to bestored.

The table 1302 shown in FIG. 13 indicates, for a given metric type, arecommended sparseness algorithm, and an observed data drop rate. Anymissing data in some examples is represented by an explicit tombstone,if a system generates sparse data than a client may be required to sendan explicit tombstone to mark the end of data stream. On the server sideassuming the client is not sending sparse data, the last epoch of a timeseries is maintained, last wall clock time and the expected updateinterval (10 min or 1 min) and the server will generate a tombstone whenthe reported data point misses an update.

Examples described thus far teach methods of deriving sparse data tofacilitate telemetry applications. A question now arises how to indexthat data and allow fast query performance. Some time series databasesindex data points based on a series identifier (metricname+metadata/dimensions). However the occurrence of significantindexing issues is a distinct possibility when telemetry systems wish tosupport ephemeral series (for example containing ephemeral universalunique identifiers (UUIDs) such as Java Fuzzy Logic (JFL) job UUID) andhigh cardinality series (for example containing metadata or dimensionswith a large range of values). To address these and other issues, FIG.14 illustrates key components in an anomaly database system 1402.

A general architecture of an anomaly database system 1402 is shown inFIG. 14 . The components in the anomaly database system 1402 include aCDM node 1404. In some examples, the CDM node 1404 includes a metricslibrary (described further below), which sends out telemetry data, inparticular metrics thereof, in a sparse manner. Some examples use simplesparseness detection algorithms such as diff-value (emit only ifdifferent from previous datapoint0 and last-value-delta (emit only ifdelta of previous value to current value is above threshold). Themetrics library store these metrics locally and sends a copy to thecloud (e.g., by statistics relay 1406) so that the data can be storedpermanently. The local storage may only retain metrics for the last 7days, or another period, for example. The statistics relay 1406 ismainly responsible for receiving streaming metrics from a plurality ofCDM nodes, of which the CDM node 1404 may form part. The statisticsrelay 1406 performs some basic blacklisting and whitelisting to cut downthe incoming metrics, and pushes all metrics to a metrics collector suchas Kafka 1408.

In some examples, the metric library allows new metrics to be createdusing a tag based schema that the anomaly database system 1402 mayrequire. In order to support legacy metrics, examples may include aTelegraph plugin (telegraph legacy metric migration 1410) that canconvert a flat format heuristically to a tag based format. The anomalydatabase system 1402 also includes a sparse consumers module 1412.

A detailed view showing aspects of an instance of the sparse consumersmodule 1412 is shown in FIG. 15 . The sparse consumers module 1412 pullsmetrics that were pushed by statistics relay 1406 from Kafka 1408. Itwill be noted with reference to FIG. 14 that metrics data on the“upstream” side of the sparse consumers module 1412 is dense (e.g. 1BDPM), whereas metrics data on the “downstream” side of the sparseconsumers module 1412 is sparse (e.g. 100M DPM).

FIG. 15 shows a more detailed view of components of an example sparseconsumers module 1412. FIG. 15 includes an example write path of asparse consumers module 1412. As mentioned above, metrics data on the“upstream” side of the sparse consumers module 1412 is dense (e.g. 1BDPM), whereas metrics data on the “downstream” side of the sparseconsumers module 1412 is sparse (e.g. 100M DPM). In this regard, examplesparse consumers modules 1412 run algorithms, such as sparse algorithm1502, to cut down on the number of data points. The example algorithmslisted below may be supported.

In the following example sparse algorithms: current_value refers to thevalue received, and last_value refers to the last value written. Thus isa diff-value algorithm (Lossless), DROP IF current_value=last_value; ina last-value-delta algorithm (useful for all metric types other thancounter), DROP IF (current_value−last_value)/last_value<threshold; in aderivative-delta algorithm (useful for counters) acurrent_derivative=(current_value−last_value)/(current_epoch−last_epoch),a last_derivative=[previous current_derivative written], DROP IFcurrent_derivative/last_derivative<threshold; in a stddev band, DROP IFmoving_avg−stddev<current_value<moving_avg+stddev; in a stddev band withlast-value fallback, DROP IF else DROP IF #2; and, in a last-value-deltawith percentile, DROP IF(current_value−last_value)/95thtPercentile_value<threshold, where95thPercentile is computed over a moving window.

Of all these algorithms, the last-value-delta has the least cost andreasonable result. The other algorithms provide better drop rates but ata higher computation and memory cost. All the algorithms above requirestorage either of the last value or recent history for each series. Thismay be stored in a stats cache that lives in memory. In some examples, asingle metric consumer will see all the metrics from a particular node,thus a fairly high cache hit ratio may be expected with very fewduplicate entries across metric consumers. In some examples, a user canoverride default behavior by editing a schema.yam1 file, this waymetrics that should be maintained in original fidelity can be preserved.

Some sparseness examples may be bounded. The sparseness algorithmsdescribed so far publish a point only if it has changed significantly.For illustration, consider a time series that is always reporting avalue of 1 every 10 mins for the whole year. The sparseness algorithmwill drop all points for this series except the very first pointreported at the beginning.

Sparseness can be very helpful from a storage perspective. However, insome case, some examples may hide sparseness from the end user, so thatuser queries can be simplified and do not have to account for missingdata. This translation of sparse to dense data creates problems for readpath and data purge: Read path: examples need to know the last value tocompute or report results in a specific time range. In amulti-dimensional metric if some series are very sparse like the exampleabove, in which event some examples can force the Read path to read allsegments for the entire year to generate results for just the last hour.In purge, purging data older than the configured retention period may bekey to containing data storage costs. However, for unbounded sparseness,a segment cannot be dropped immediately because the oldest data might beneeded to do a sparse-to-dense conversion. For example, consider theexample of a time series which is generating a value of 1 every 10 minsfor a whole year, and has a retention of 1 year. Say we stored a valuefor 1 Jan. 2019 00:00:00 and no data points thereafter since they weresimilar. That first data point on 1 Jan. 2020 cannot be deleted becauseit is needed for dense conversion.

Examples address these problems by introducing bounded sparseness, whereexamples add another condition that two consecutive sparse points mustnot be more than a half of a segment interval apart. This ensures thateach segment contains the last valid value of each series. This allowsthe read path to limit its queries to relevant segments only and allowsan applicable purge policy to drop segments based on their age withouthaving to scan their contents, because any value in them is safelyreplicated in future segments as well. This may come at some additionalcost of duplicating values that otherwise would have been dropped.However the simplification of purge and read path may justify theadditional cost.

Some examples of an anomaly database system 1402 may include a tokenizer1504. A tokenizer maps verbose human friendly metadata names to tokenvalues in an integer namespace. Based on data from an example node atokenizer can reduce metadata size by 85%. Note this is different thencompression. Because unlike a compressed form, all of the sparsenessalgorithms can still operate on the tokenized data points, this includesdruid and a read api (for example read API 1416 in FIG. 14 ). In someexamples, the de-tokenization occurs in read API 1416 just beforeresults are returned to a user. A read API 1416 cache may also storeresults in tokenized format.

In some examples, a token map can be built offline to generate animmutable token map that can be loaded directly in memory. This removesthe need for any synchronization between containers and will not haveany negative impact on throughput. An offline token building processallows examples to perform global optimizations in assigning smallertoken values to high frequency words. Tokenization may not be applicableto ephemeral tag values, mainly because it will significantly increasethe number of tokens and examples may lose the compression factorobtained from tokenization.

As mentioned above, the anomaly database system 1402 includes a rollupmodule 1414 in some examples. A detailed example is shown in FIG. 16 .Performing rollups conveniently enables read queries over very largetime ranges. In absence of rollups such queries can request largevolumes of data from the backend. With rollups examples can reduce thevolume of data points requested without loss of efficacy. In someexamples, a rollup module 1414 is implemented in a metric consumer andoperates on sparse data to compute a time weighted mean over a rollupwindow. Optimal rollup intervals can change dramatically based on querypatterns and the exact rollup window may be fine-tuned based on usagedata. As noted in FIG. 16 , insignificant data points can be dropped byas much as 90%.

The anomaly database system 1402 may also include a baseline estimator1418. The goal of a baseline estimator 1418 is to pre-compute baselineson streaming data to enable anomaly detection, correlations andmulti-series sparseness. The baseline estimator 1418 may communicatewith a time-indexed cache 1426, such as Redis. Redis is an open source(BSD licensed), in-memory data structure store, used as a database,cache and message broker. It supports data structures such as strings,hashes, lists, sets, sorted sets with range queries, bitmaps, hyperlogs,geospatial indexes with radius queries and streams.

The anomaly database system 1402 may also include, in a backend, areal-time analytics database 1420, such as Apache Druid. Apache Druid isa database that is most often used for powering use cases wherereal-time ingest, fast query performance, and high uptime are important.As such, Druid is commonly used for powering GUIs of analyticalapplications, or as a backend for highly-concurrent APIs that need fastaggregations. The real-time analytics database 1420 may include an indexand value function. For example, the real-time analytics database 1420may provide an index along with the value store makes it easier to builda TSDB without requiring a second indexing solution. The real-timeanalytics database 1420 may include a time based index in which segmentsare partitioned by time. This also implies that the real-time analyticsdatabase 1420 can support ephemeral tag values natively since time-basedindexing prevents an error.

The real-time analytics database 1420 may further include a co-locateddimensional space. Thus, some examples include a data model that storesall dimensional series of a metric in the same data source. In someexamples, all the dimensional data for a particular metric is stored inthe same segment. A common read access pattern is to provide anaggregate over the dimensional space of a particular metric. Thiscolocation of data allows an amortization of total cost. For example, toillustrate this let us imagine a metric: CPU utilization, thedimensional space is cores X nodes. If each node has 10 cores and wehave 30 K nodes, then we will have 300 K time series. A read query maywant to plot the average CPU usage across all cores and nodes, such aquery will need to read all 300 K time series in order to compute theaggregated value. If this data is modelled in Cassandra, a typical datamodel would be to have a partition key per series. This implies we willhave 300 K keys that will need a lookup. This data can get scatteredover many SSTables and our aggregate query may be required to load manySSTables, where the actual useful data per SSTable is very small. Inpresent examples, an anomaly database system 1402 that includes Druid(for example) can model this such that the CPU usage across alldimensions (nodes, cores) is stored in a single data source whichtranslates to all 300K series will be packed into a single segment. Soour example read query will result in loading a single segment and mostof the data in the segment will be useful.

In some examples, an anomaly database system 1402 includes data models.An example schema 1702 is shown in FIG. 17 . Druid data model forexample starts with a datasource which has dimensions and metrics. Anexample anomaly database system 1402 supports an Influx schema. Theschema 1702 may include measurement, field and tags.

Another example schema 1802 is shown in FIG. 18 . This tagged schemaexample includes measurement (for example diamond.process), tags (forexample cluster and node), and fields (for example CPU and disk). Thisdata model encourages collocating all series associated with aparticular field into a single segment. This also encourages collocatingall fields that are likely to be displayed or accessed together based onthe common measurement.

A graphite flat schema example is show at schema 1902 in FIG. 19 .

As mentioned above, an example anomaly database system 1402 includes aread API 1416. FIG. 20 illustrates the major components of the read API1416. In some examples, the read API 1416 provides an interface, such asan Influx query language (QL) interface (see schema 1702 above) tometrics stored in the anomaly database system 1402. Some examples mayinclude a custom implementation instead of directly using druid API forthe following reasons: the need for sparse data—a Druid API for examplehas no notion of sparse data, so any aggregation will not generatecorrect results; cross data source aggregation—Druid for example doesnot support cross data source aggregation; and in light of certainInflux QL and custom functions.

In some examples, the read API 1416 includes its own query cache 2002,despite the fact that Druid (for example) also provides cachingcapabilities. The main motivator for a local query cache 2002 is thatthe anomaly database system 1402 may perform significant aggregations inthe read API 1416 and the cost of doing that can be amortized by cachingat the top level. Another motivator is the type of cache that an anomalydatabase system 1402 may use, the common case in dashboards is to repeatthe same query with the same interval at a continuous refresh rate. Hereis an example query from Grafana set to auto-refresh with the last 1hour window. SELECT mean(“diamond.user_percent”) FROM “tsdb” WHEREtime>=now( )−1 h AND cluster=‘f5888c22-9651-4cd0-8e3a-90367d9242c71’ ANDnode=‘RVHMZ321A719’ GROUP BY time(10s). The above query is repeatedevery 10s generating a moving window of data points. Since the query hasa relative time-definition, examples take care in using simple cacheswhich will generate a hit against the query and return stale data. Anexample read API 1416 parser converts the relative time to absolute timeand then from the Druid's perspective such queries are new queries everytime and may not generate a cache hit. The query cache 2002 built upuses the query without the time-range as a key, and caches the resultsby time ranges. The query cache 2002 can thus give partial hits of partof the time range that is in the cache. In some examples, the cachedresults are stored in memory of the Read API task and are also backed upin time-indexed cache 1426 (e.g. Redis). The time-indexed cache 1426(Redis) allows examples to leverage a larger RAM and collect cacheresults from multiple read api nodes. The v0 cached results are notstored compressed, examples may enable compression based on the cost ofcompression versus the benefit of more cache space.

An example anomaly database system 1402 may include a user interface1422. An example user interface 1422 may include a Grafana dashboardinguser interface (UI) that can be used for visualizing metrics from theanomaly database system 1402.

With reference back to FIG. 14 , an example anomaly database system 1402may also include an alerts framework 1424. An example alerts framework1424 may query a TSDB with two kind of queries: baseline queries andcheck queries. Both the queries typically operate over the entiredimensional space. Baseline queries are typically executed over a 10 dayperiod at 10 minute granularity once a day. Check queries are typicallyexecuted over last hour period at 10 minute granularity every 10minutes. Baseline queries are extremely expensive because they request avery large volume of data, but these are infrequent queries (1/day peralert). Check queries are inexpensive because they are scoped to mostrecent data but these are very frequent ( 1/10 minute per alert). Someexample anomaly database systems 1402 implement baseline algorithms andanomaly detection algorithms natively in the read API 1416 where thesealgorithms are able to operate on sparse data directly. This reducesoverall read workload from the alerts framework 1424 by the same orderof magnitude that sparseness reduces the write workload. Additionally,this also provides an added benefit that developers can query foranomalous events directly from the anomaly database system 1402 insteadof relying on an alert framework which files jira tickets.

In some example anomaly database systems 1402, access is restricted bysecurity groups that only allow machines within the anomaly databasesystem 1402 to communicate among each other. Only the read API 1416 isexposed outside of the anomaly database system 1402 security groupthrough an endpoint (for example, https://anomalydb.rubrik.com). Thisendpoint may also be restricted to product security groups and VPN.

In some examples, the anomaly database systems 1402 is a complex systemcontaining many parts and each of these needs to be carefully monitored.In some instances, an anomaly database system 1402 should not store itsown telemetry to avoid situations where the anomaly database system 1402is misbehaving and an operator has no visibility into that becausesystem metrics were lost. For this reason in some examples all telemetrygenerated by an anomaly database system 1402 itself is stored in anexternal database, such as MetricTank and Amazon Web Service (AWS)cloudwatch. Dashboards such as Grafana Dashboard and CloudwatchDashboard may track overall activity.

With reference to FIG. 21 , certain operations in an examplecomputer-implemented method 2100 at a networked computing system areprovided. An example method 2100 processes metrics in telemetry data inan anomaly database system comprising a CDM node. An example method 2100comprises: at operation 2102, receiving by a statistics relay streamingmetrics from nodes in a node cluster, the node cluster including the CDMnode, the statistics relay pushing the received metrics to a metricscollector; and, at operation 2104, pulling metrics, by a sparseconsumers module, from the metrics collector.

In some examples, the operations further comprise running a sparsealgorithm on the pulled metrics to reduce a number of data points. Insome examples, the sparse algorithm is selected from a group of sparsealgorithms comprising: a diff-value algorithm, a last-value-deltaalgorithm, a standard deviation band algorithm, a standard deviationband algorithm with a last-value fallback, and a last-value-delta withpercentile algorithm. In some examples, values generated by the sparsealgorithm are bounded and assigned a publication status based on fallingwithin a bounded value.

In some examples, the operations further comprise enabling read queries,by a rollup module, over a designated time range.

In some examples, the operations further comprise pre-computingbaselines on the streaming metrics to enable anomaly detection,correlations and multi-series sparseness.

In some examples, a non-transitory machine-readable medium includesinstructions which, when read by a machine (apparatus), cause themachine to perform operations in a method of processing metrics intelemetry data in an anomaly database system comprising a CDM node.Example operations may include the operations summarized above.

FIG. 22 is a block diagram illustrating an example of a computersoftware architecture for data classification and information securitythat may be installed on a machine, according to some exampleembodiments. FIG. 22 is merely a non-limiting example of a softwarearchitecture 2202, and it will be appreciated that many otherarchitectures may be implemented to facilitate the functionalitydescribed herein. The software architecture 2202 may be executing onhardware such as a machine 2400 of FIG. 24 that includes, among otherthings, processor 2346, memory 2348, and I/O components 2350. Arepresentative hardware layer 2204 of FIG. 22 is illustrated and canrepresent, for example, the machine 2400 of FIG. 24 . The representativehardware layer 2204 of FIG. 22 comprises one or more processing units2206 having associated executable instructions 2208. The executableinstructions 2208 represent the executable instructions of the softwarearchitecture 2202, including implementation of the methods, modules, andso forth described herein. The representative hardware layer 2204 alsoincludes memory or storage modules 2210, which also have the executableinstructions 2208. The representative hardware layer 2204 may alsocomprise other hardware 2212, which represents any other hardware of therepresentative hardware layer 2204, such as the other hardwareillustrated as part of the machine 220.

In the example architecture of FIG. 22 , the software architecture 2202may be conceptualized as a stack of layers, where each layer providesparticular functionality. For example, the software architecture 2202may include layers such as an operating system 2214, libraries 2218,frameworks and/or middleware 2216, applications 2220, and a presentationlayer 2242. Operationally, the applications 2220 or other componentswithin the layers may invoke API calls 2222 through the software stackand receive a response, returned values, and so forth (illustrated asmessages 2224) in response to the API calls 2222. The layers illustratedare representative in nature, and not all software architectures haveall layers. For example, some mobile or special purpose operatingsystems may not provide a frameworks and/or middleware 2216 layer, whileothers may provide such a layer. Other software architectures mayinclude additional or different layers.

The operating system 2214 may manage hardware resources and providecommon services. The operating system 2214 may include, for example, akernel 2228, services 2226, and drivers 2230. The kernel 2228 may act asan abstraction layer between the hardware and the other software layers.For example, the kernel 2228 may be responsible for memory management,processor management (e.g., scheduling), component management,networking, security settings, and so on. The services 2226 may provideother common services for the other software layers. The drivers 2230may be responsible for controlling or interfacing with the underlyinghardware. For instance, the drivers 2230 may include display drivers,camera drivers, Bluetooth® drivers, flash memory drivers, serialcommunication drivers (e.g., Universal Serial Bus (USB) drivers), Wi-Fi®drivers, audio drivers, power management drivers, and so forth dependingon the hardware configuration.

The libraries 2218 may provide a common infrastructure that may beutilized by the applications 2220 and/or other components and/or layers.The libraries 2218 typically provide functionality that allows othersoftware modules to perform tasks in an easier fashion than byinterfacing directly with the underlying operating system 2214functionality (e.g., kernel 2228, services 2226, or drivers 2230). Thelibraries 2218 may include system libraries 2232 (e.g., C standardlibrary) that may provide functions such as memory allocation functions,string manipulation functions, mathematic functions, and the like. Inaddition, the libraries 2218 may include API libraries 2234 such asmedia libraries (e.g., libraries to support presentation andmanipulation of various media formats such as MPEG4, H.264, MP3, AAC,AMR, JPG, PNG), graphics libraries (e.g., an OpenGL framework that maybe used to render 2D and 3D graphic content on a display), databaselibraries (e.g., SQLite that may provide various relational databasefunctions), web libraries (e.g., WebKit that may provide web browsingfunctionality), and the like. The libraries 2218 may also include a widevariety of other libraries 2236 to provide many other APIs to theapplications 2220 and other software components/modules.

The frameworks (also sometimes referred to as middleware) may provide ahigher-level common infrastructure that may be utilized by theapplications 2220 or other software components/modules. For example, theframeworks and/or middleware 2216 may provide various graphic userinterface (GUI) functions, high-level resource management, high-levellocation services, and so forth. The frameworks and/or middleware 2216may provide a broad spectrum of other APIs that may be utilized by theapplications 2220 and/or other software components/modules, some ofwhich may be specific to a particular operating system or platform.

The applications 2220 include built-in applications 2238 and/orthird-party applications 2240. Examples of representative built-inapplications 2238 may include, but are not limited to, a homeapplication, a contacts application, a browser application, a bookreader application, a location application, a media application, amessaging application, or a game application.

The third-party applications 2200 may include any of the built-inapplications 2238, as well as a broad assortment of other applications.In a specific example, the third-party applications 2200 (e.g., anapplication developed using the Android™ or iOS™ software developmentkit (SDK) by an entity other than the vendor of the particular platform)may be mobile software running on a mobile operating system such asiOS™, Android™, Windows® Phone, or other mobile operating systems. Inthis example, the third-party applications 2200 may invoke the API calls2222 provided by the mobile operating system such as the operatingsystem 2214 to facilitate functionality described herein.

The applications 2220 may utilize built-in operating system functions(e.g., kernel 2228, services 2226, or drivers 2230), libraries (e.g.,system libraries 2232, API libraries 2234, and other libraries 2236), orframeworks and/or middleware 2216 to create user interfaces to interactwith users of the system. Alternatively, or additionally, in somesystems, interactions with a user may occur through a presentation layer2250, such as the presentation layer 2242. In these systems, theapplication/module “logic” can be separated from the aspects of theapplication/module that interact with the user.

Some software architectures utilize virtual machines. In the example ofFIG. 22 , this is illustrated by a virtual machine 2246. A virtualmachine creates a software environment where applications/modules canexecute as if they were executing on a hardware machine e.g., themachine 2400 of FIG. 24 , for example). A virtual machine 2246 is hostedby a host operating system (e.g., operating system 2214) and typically,although not always, has a virtual machine monitor 2244, which managesthe operation of the virtual machine 2246 as well as the interface withthe host operating system (e.g., operating system 2214). A softwarearchitecture executes within the virtual machine 2246, such as anoperating system 2248, libraries 2256, frameworks/middleware 2254,applications 2252, or a presentation layer 2242. These layers ofsoftware architecture executing within the virtual machine 2246 can bethe same as corresponding layers previously described or may bedifferent.

FIG. 23 is a block diagram 230 illustrating an architecture of software2302, which can be installed on any one or more of the devices describedabove. FIG. 23 is merely a non-limiting example of a softwarearchitecture, and it will be appreciated that many other architecturescan be implemented to facilitate the functionality described herein. Invarious embodiments, the software 2302 is implemented by hardware suchas a machine 2400 of FIG. 24 that includes processor(s) 2346, memory2348, and I/O components 2350. In this example architecture, thesoftware 2302 can be conceptualized as a stack of layers where eachlayer may provide a particular functionality. For example, the software2302 includes layers such as an operating system 2304, libraries 2308,frameworks 2306, and applications 2310. Operationally, the applications2310 invoke API calls 2312 (application programming interface) throughthe software stack and receive messages 2314 in response to the APIcalls 2312, consistent with some embodiments.

In various implementations, the operating system 2304 manages hardwareresources and provides common services. The operating system 2304includes, for example, a kernel 2316, services 2320, and drivers 2318.The kernel 2316 acts as an abstraction layer between the hardware andthe other software layers, consistent with some embodiments. Forexample, the kernel 2316 provides memory management, processormanagement (e.g., scheduling), component management, networking, andsecurity settings, among other functionality. The services 2320 canprovide other common services for the other software layers. The drivers2318 are responsible for controlling or interfacing with the underlyinghardware, according to some embodiments. For instance, the drivers 2318can include display drivers, camera drivers, BLUETOOTH® or BLUETOOTH®Low Energy drivers, flash memory drivers, serial communication drivers(e.g., Universal Serial Bus (USB) drivers), WI-FI® drivers, audiodrivers, power management drivers, and so forth.

In some embodiments, the libraries 2308 provide a low-level commoninfrastructure utilized by the applications 2310. The libraries 2308 caninclude system libraries 2322 (e.g., C standard library) that canprovide functions such as memory allocation functions, stringmanipulation functions, mathematic functions, and the like. In addition,the libraries 2308 can include API libraries 2324 such as medialibraries (e.g., libraries to support presentation and manipulation ofvarious media formats such as Moving Picture Experts Group-4 (MPEG4),Advanced Video Coding (H.264 or AVC), Moving Picture Experts GroupLayer-3 (MP3), Advanced Audio Coding (AAC), Adaptive Multi-Rate (AMR)audio codec, Joint Photographic Experts Group (JPEG or JPG), or PortableNetwork Graphics (PNG)), graphics libraries (e.g., an OpenGL frameworkused to render in two dimensions (2D) and three dimensions (3D) in agraphic content on a display), database libraries (e.g., SQLite toprovide various relational database functions), web libraries (e.g.,WebKit to provide web browsing functionality), and the like. Thelibraries 2308 can also include a wide variety of other libraries 2326to provide many other APIs to the applications 2310.

The frameworks 2306 provide a high-level common infrastructure that canbe utilized by the applications 2310, according to some embodiments. Forexample, the frameworks 2306 provide various graphic user interface(GUI) functions, high-level resource management, high-level locationservices, and so forth. The frameworks 2306 can provide a broad spectrumof other APIs that can be utilized by the applications 2310, some ofwhich may be specific to a particular operating system or platform.

In an example embodiment, the applications 2310 include a homeapplication 2328, a contacts application 2330, a browser application2332, a book reader application 2334, a location application 2336, amedia application 2338, a messaging application 2340, a game application2342, and a broad assortment of other applications, such as athird-party application 2344. According to some embodiments, theapplications 2310 are programs that execute functions defined in theprograms. Various programming languages can be employed to create one ormore of the applications 2310, structured in a variety of manners, suchas object-oriented programming languages (e.g., Objective-C, Java, orC++) or procedural programming languages (e.g., C or assembly language).In a specific example, the third-party application 2344 (e.g., anapplication developed using the ANDROID™ or IOS™ software developmentkit (SDK) by an entity other than the vendor of the particular platform)may be mobile software running on a mobile operating system such asIOS™, ANDROID™, WINDOWS® Phone, or another mobile operating system. Inthis example, the third-party application 2344 can invoke the API calls2312 provided by the operating system 2304 to facilitate functionalitydescribed herein.

FIG. 24 illustrates a diagrammatic representation of a machine 2400 inthe form of a computer system within which a set of instructions may beexecuted for causing the machine to perform any one or more of themethodologies discussed herein, according to an example embodiment.Specifically, FIG. 24 shows a diagrammatic representation of the machine2400 in the example form of a computer system, within which instructions2406 (e.g., software, a program, an application, an applet, an app, orother executable code) for causing the machine 2400 to perform any oneor more of the methodologies discussed herein may be executed.Additionally, or alternatively, the instructions 2406 may implement theoperations of the method shown in FIG. 21 , or as elsewhere describedherein.

The instructions 2406 transform the general, non-programmed machine 2400into a particular machine 2400 programmed to carry out the described andillustrated functions in the manner described. In alternativeembodiments, the machine 2400 operates as a standalone device or may becoupled (e.g., networked) to other machines. In a networked deployment,the machine 2400 may operate in the capacity of a server machine or aclient machine in a server-client network environment, or as a peermachine in a peer-to-peer (or distributed) network environment. Themachine 2400 may comprise, but not be limited to, a server computer, aclient computer, a personal computer (PC), a tablet computer, a laptopcomputer, a netbook, a set-top box (STB), a PDA, an entertainment mediasystem, a cellular telephone, a smart phone, a mobile device, a wearabledevice (e.g., a smart watch), a smart home device (e.g., a smartappliance), other smart devices, a web appliance, a network router, anetwork switch, a network bridge, or any machine capable of executingthe instructions 2406, sequentially or otherwise, that specify actionsto be taken by the machine 240. Further, while only a single machine2400 is illustrated, the term “machine” shall also be taken to include acollection of machines 2400 that individually or jointly execute theinstructions 2406 to perform any one or more of the methodologiesdiscussed herein.

The machine 2400 may include processor(s) 2346, memory 2348, and I/Ocomponents 2350, which may be configured to communicate with each othersuch as via a bus 2402. In an example embodiment, the processor(s) 2346(e.g., a Central Processing Unit (CPU), a Reduced Instruction SetComputing (RISC) processor, a Complex Instruction Set Computing (CISC)processor, a Graphics Processing Unit (GPU), a Digital Signal Processor(DSP), an ASIC, a Radio-Frequency Integrated Circuit (RFIC), anotherprocessor, or any suitable combination thereof) may include, forexample, a processor 2404 and a processor 2408 that may execute theinstructions 2406. The term “processor” is intended to includemulti-core processors that may comprise two or more independentprocessors (sometimes referred to as “cores”) that may executeinstructions contemporaneously. Although FIG. 24 shows multipleprocessor(s) 2346, the machine 2400 may include a single processor witha single core, a single processor with multiple cores (e.g., amulti-core processor), multiple processors with a single core, multipleprocessors with multiples cores, or any combination thereof.

The memory 2348 may include a main memory 2412, a static memory 2410,and a storage unit 2416, each accessible to the processor(s) 2346 suchas via the bus 2402. The main memory 2412, the static memory 2410, andstorage unit 2416 store the instructions 2406 embodying any one or moreof the methodologies or functions described herein. The instructions2406 may also reside, completely or partially, within the main memory2412, within the static memory 2410, within the storage unit 2416,within at least one of the processor(s) 2346 (e.g., within theprocessor's cache memory), or any suitable combination thereof, duringexecution thereof by the machine 240.

The I/O components 2350 may include a wide variety of components toreceive input, provide output, produce output, transmit information,exchange information, capture measurements, and so on. The specific I/Ocomponents 2350 that are included in a particular machine will depend onthe type of machine. For example, portable machines such as mobilephones will likely include a touch input device or other such inputmechanisms, while a headless server machine will likely not include sucha touch input device. It will be appreciated that the I/O components2350 may include many other components that are not shown in FIG. 24 .The I/O components 2350 are grouped according to functionality merelyfor simplifying the following discussion and the grouping is in no waylimiting. In various example embodiments, the I/O components 2350 mayinclude output components 2420 and input components 2422. The outputcomponents 2420 may include visual components (e.g., a display such as aplasma display panel (PDP), a light emitting diode (LED) display, aliquid crystal display (LCD), a projector, or a cathode ray tube (CRT)),acoustic components (e.g., speakers), haptic components (e.g., avibratory motor, resistance mechanisms), other signal generators, and soforth. The input components 2422 may include alphanumeric inputcomponents (e.g., a keyboard, a touch screen configured to receivealphanumeric input, a photo-optical keyboard, or other alphanumericinput components), point-based input components (e.g., a mouse, atouchpad, a trackball, a joystick, a motion sensor, or another pointinginstrument), tactile input components (e.g., a physical button, a touchscreen that provides location and/or force of touches or touch gestures,or other tactile input components), audio input components (e.g., amicrophone), and the like.

In further example embodiments, the I/O components 2350 may includebiometric components 2424, motion components 2426, environmentalcomponents 2428, or position components 2430, among a wide array ofother components. For example, the biometric components 2424 may includecomponents to detect expressions (e.g., hand expressions, facialexpressions, vocal expressions, body gestures, or eye tracking), measurebiosignals (e.g., blood pressure, heart rate, body temperature,perspiration, or brain waves), identify a person (e.g., voiceidentification, retinal identification, facial identification,fingerprint identification, or electroencephalogram-basedidentification), and the like. The motion components 2426 may includeacceleration sensor components (e.g., accelerometer), gravitation sensorcomponents, rotation sensor components (e.g., gyroscope), and so forth.The environmental components 2428 may include, for example, illuminationsensor components (e.g., photometer), temperature sensor components(e.g., one or more thermometers that detect ambient temperature),humidity sensor components, pressure sensor components (e.g.,barometer), acoustic sensor components (e.g., one or more microphonesthat detect background noise), proximity sensor components (e.g.,infrared sensors that detect nearby objects), gas sensors (e.g., gasdetection sensors to detection concentrations of hazardous gases forsafety or to measure pollutants in the atmosphere), or other componentsthat may provide indications, measurements, or signals corresponding toa surrounding physical environment. The position components 2430 mayinclude location sensor components (e.g., a global positioning system(GPS) receiver component), altitude sensor components (e.g., altimetersor barometers that detect air pressure from which altitude may bederived), orientation sensor components (e.g., magnetometers), and thelike.

Communication may be implemented using a wide variety of technologies.The I/O components 2350 may include communication components 2434operable to couple the machine 2400 to a network 2438 or devices 2432via a coupling 2414 and a coupling 2436, respectively. For example, thecommunication components 2434 may include a network interface componentor another suitable device to interface with the network 2438. Infurther examples, the communication components 2434 may include wiredcommunication components, wireless communication components, cellularcommunication components, Near Field Communication (NFC) components,Bluetooth® components (e.g., Bluetooth® Low Energy), Wi-Fi® components,and other communication components to provide communication via othermodalities. The devices 2432 may be another machine or any of a widevariety of peripheral devices (e.g., a peripheral device coupled via aUSB).

Moreover, the communication components 2434 may detect identifiers orinclude components operable to detect identifiers. For example, thecommunication components 2434 may include Radio Frequency Identification(RFID) tag reader components, NFC smart tag detection components,optical reader components (e.g., an optical sensor to detectone-dimensional bar codes such as Universal Product Code (UPC) bar code,multi-dimensional bar codes such as Quick Response (QR) code, Azteccode, Data Matrix, Dataglyph, MaxiCode, PDF417, Ultra Code, UCC RSS-2Dbar code, and other optical codes), or acoustic detection components(e.g., microphones to identify tagged audio signals). In addition, avariety of information may be derived via the communication components2434, such as location via IP geolocation, location via Wi-Fi® signaltriangulation, location via detecting an NFC beacon signal that mayindicate a particular location, and so forth.

The various memories (i.e., memory 2348, main memory 2412, and/or staticmemory 2410) and/or storage unit 2416 may store one or more sets ofinstructions and data structures (e.g., software) embodying or utilizedby any one or more of the methodologies or functions described herein.These instructions (e.g., the instructions 2406), when executed byprocessor(s) 2346, cause various operations to implement the disclosedembodiments. The instructions 2406 may be stored in machine-readablemedium 2418.

As used herein, the terms “machine-storage medium,” “device-storagemedium,” “computer-storage medium” mean the same thing and may be usedinterchangeably in this disclosure. The terms refer to a single ormultiple storage devices and/or media (e.g., a centralized ordistributed database, and/or associated caches and servers) that storeexecutable instructions and/or data. The terms shall accordingly betaken to include, but not be limited to, solid-state memories, andoptical and magnetic media, including memory internal or external toprocessors. Specific examples of machine-storage media, computer-storagemedia and/or device-storage media include non-volatile memory, includingby way of example semiconductor memory devices, e.g., erasableprogrammable read-only memory (EPROM), electrically erasableprogrammable read-only memory (EEPROM), FPGA, and flash memory devices;magnetic disks such as internal hard disks and removable disks;magneto-optical disks; and CD-ROM and DVD-ROM disks. The terms“machine-storage media,” “computer-storage media,” and “device-storagemedia” specifically exclude carrier waves, modulated data signals, andother such media, at least some of which are covered under the term“signal medium” discussed below.

In various example embodiments, one or more portions of the network 2438may be an ad hoc network, an intranet, an extranet, a VPN, a LAN, aWLAN, a WAN, a WWAN, a MAN, the Internet, a portion of the Internet, aportion of the PSTN, a plain old telephone service (POTS) network, acellular telephone network, a wireless network, a Wi-Fi® network,another type of network, or a combination of two or more such networks.For example, the network 2438 or a portion of the network 2438 mayinclude a wireless or cellular network, and the coupling 2414 may be aCode Division Multiple Access (CDMA) connection, a Global System forMobile communications (GSM) connection, or another type of cellular orwireless coupling. In this example, the coupling 2414 may implement anyof a variety of types of data transfer technology, such as SingleCarrier Radio Transmission Technology (1×RTT), Evolution-Data Optimized(EVDO) technology, General Packet Radio Service (GPRS) technology,Enhanced Data rates for GSM Evolution (EDGE) technology, thirdGeneration Partnership Project (3GPP) including 3G, fourth generationwireless (4G) networks, Universal Mobile Telecommunications System(UMTS), High Speed Packet Access (HSPA), Worldwide Interoperability forMicrowave Access (WiMAX), Long Term Evolution (LTE) standard, othersdefined by various standard-setting organizations, other long rangeprotocols, or other data transfer technology.

The instructions 2406 may be transmitted or received over the network2438 using a transmission medium via a network interface device (e.g., anetwork interface component included in the communication components2434) and utilizing any one of a number of well-known transfer protocols(e.g., hypertext transfer protocol (HTTP)). Similarly, the instructions2406 may be transmitted or received using a transmission medium via thecoupling 2436 (e.g., a peer-to-peer coupling) to the devices 2432. Theterms “transmission medium” and “signal medium” mean the same thing andmay be used interchangeably in this disclosure. The terms “transmissionmedium” and “signal medium” shall be taken to include any intangiblemedium that is capable of storing, encoding, or carrying theinstructions 2406 for execution by the machine 240, and includes digitalor analog communications signals or other intangible media to facilitatecommunication of such software. Hence, the terms “transmission medium”and “signal medium” shall be taken to include any form of modulated datasignal, carrier wave, and so forth. The term “modulated data signal”means a signal that has one or more of its characteristics set orchanged in such a matter as to encode information in the signal.

The terms “machine-readable medium,” “computer-readable medium” and“device-readable medium” mean the same thing and may be usedinterchangeably in this disclosure. The terms are defined to includeboth machine-storage media and transmission media. Thus, the termsinclude both storage devices/media and carrier waves/modulated datasignals.

Although examples have been described with reference to specific exampleembodiments or methods, it will be evident that various modificationsand changes may be made to these embodiments without departing from thebroader scope of the embodiments. Accordingly, the specification anddrawings are to be regarded in an illustrative rather than a restrictivesense. The accompanying drawings that form a part hereof, show by way ofillustration, and not of limitation, specific embodiments in which thesubject matter may be practiced. The embodiments illustrated aredescribed in sufficient detail to enable those skilled in the art topractice the teachings disclosed herein. Other embodiments may beutilized and derived therefrom, such that structural and logicalsubstitutions and changes may be made without departing from the scopeof this disclosure. This detailed description, therefore, is not to betaken in a limiting sense, and the scope of various embodiments isdefined only by the appended claims, along with the full range ofequivalents to which such claims are entitled.

Such embodiments of the inventive subject matter may be referred toherein, individually and/or collectively, by the term “invention” merelyfor convenience and without intending to voluntarily limit the scope ofthis application to any single invention or inventive concept if morethan one is in fact disclosed. Thus, although specific embodiments havebeen illustrated and described herein, it should be appreciated that anyarrangement calculated to achieve the same purpose may be substitutedfor the specific embodiments shown. This disclosure is intended to coverany and all adaptations or variations of various embodiments.Combinations of the above embodiments, and other embodiments notspecifically described herein, will be apparent to those of skill in theart upon reviewing the above description.

What is claimed is:
 1. An apparatus for telemetry data processing at adata management system, comprising: at least one processor; memorycoupled to the at least one processor; and instructions stored in thememory and executable by the at least one processor to: store telemetrydata at a plurality of nodes in the data management system, thetelemetry data comprising a plurality of data points that indicateperformance information related to the data management system; receive aquery at the data management system, the query comprising a request todisplay at least a portion of the telemetry data via a user interfaceassociated with the data management system; select a subset of therequested telemetry data using one or more of a last-value-deltaalgorithm, a diff-value algorithm, or a standard deviation bandalgorithm; obtain the selected subset of the requested telemetry datafrom the plurality of nodes; and output, in response to the query, anindication of the selected subset of the requested telemetry data fordisplay via the user interface.
 2. The apparatus of claim 1, wherein, toobtain the selected subset of the requested telemetry data, theinstructions are executable by the at least one processor to cause theapparatus to: obtain a first portion of the selected subset of therequested telemetry data from a first node of the plurality of nodes inthe data management system; and obtain a second portion of the selectedsubset of the requested telemetry data from a second node of theplurality of nodes in the data management system, wherein the secondnode is different from the first node.
 3. The apparatus of claim 1,wherein the instructions are executable by the at least one processor tocause the apparatus to: receive the query at a first node of theplurality of nodes in the data management system while some or all ofthe requested telemetry data is stored at a second node of the pluralityof nodes in the data management system.
 4. The apparatus of claim 3,wherein the instructions are further executable by the at least oneprocessor to cause the apparatus to: transmit at least a portion of therequested telemetry data from the second node to the first node inresponse to the query.
 5. The apparatus of claim 1, wherein, to receivethe query at the data management system, the instructions are executableby the at least one processor to cause the apparatus to: receive, viathe user interface, a request to display telemetry data associated witha designated time range, wherein the selected subset of the requestedtelemetry data comprises a subset of the plurality of data points thatcorrespond to the designated time range.
 6. The apparatus of claim 1,wherein, to select the subset of the requested telemetry data, theinstructions are executable by the at least one processor to cause theapparatus to: determine that a variance of a data point with respect toa previous data point or a subsequent data point is above a threshold,wherein the requested telemetry data comprises the previous data point,the data point, and the subsequent data point; and select the data pointto include in the subset of the requested telemetry data based at leastin part on the determination.
 7. The apparatus of claim 1, wherein theperformance information comprises system failure information, resourceutilization information, system activity information, or a combinationthereof.
 8. A method for telemetry data processing at a data managementsystem, comprising: storing telemetry data at a plurality of nodes inthe data management system, the telemetry data comprising a plurality ofdata points that indicate performance information related to the datamanagement system; receiving a query at the data management system, thequery comprising a request to display at least a portion of thetelemetry data via a user interface associated with the data managementsystem; selecting a subset of the requested telemetry data using one ormore of a last-value-delta algorithm, a diff-value algorithm, or astandard deviation band algorithm; obtaining the selected subset of therequested telemetry data from the plurality of nodes; and outputting, inresponse to the query, an indication of the selected subset of therequested telemetry data for display via the user interface.
 9. Themethod of claim 8, wherein obtaining the selected subset of therequested telemetry data comprises: obtaining a first portion of theselected subset of the requested telemetry data from a first node of theplurality of nodes in the data management system; and obtaining a secondportion of the selected subset of the requested telemetry data from asecond node of the plurality of nodes in the data management system,wherein the second node is different from the first node.
 10. The methodof claim 8, wherein: the query is received at a first node of theplurality of nodes in the data management system while some or all ofthe requested telemetry data is stored at a second node of the pluralityof nodes in the data management system.
 11. The method of claim 10,further comprising: transmitting at least a portion of the requestedtelemetry data from the second node to the first node in response to thequery.
 12. The method of claim 8, wherein receiving the query at thedata management system comprises: receiving, via the user interface, arequest to display telemetry data associated with a designated timerange, wherein the selected subset of the requested telemetry datacomprises a subset of the plurality of data points that correspond tothe designated time range.
 13. The method of claim 8, wherein selectingthe subset of the requested telemetry data comprises: determining that avariance of a data point with respect to a previous data point or asubsequent data point is above a threshold, wherein the requestedtelemetry data comprises the previous data point, the data point, andthe subsequent data point; and selecting the data point to include inthe subset of the requested telemetry data based at least in part on thedetermination.
 14. The method of claim 8, wherein the performanceinformation comprises system failure information, resource utilizationinformation, system activity information, or a combination thereof. 15.A non-transitory computer-readable medium storing code for telemetrydata processing at a data management system, the code comprisinginstructions executable by at least one processor to: store telemetrydata at a plurality of nodes in the data management system, thetelemetry data comprising a plurality of data points that indicateperformance information related to the data management system; receive aquery at the data management system, the query comprising a request todisplay at least a portion of the telemetry data via a user interfaceassociated with the data management system; select a subset of therequested telemetry data using one or more of a last-value-deltaalgorithm, a diff-value algorithm, or a standard deviation bandalgorithm; obtain the selected subset of the requested telemetry datafrom the plurality of nodes; and output, in response to the query, anindication of the selected subset of the requested telemetry data fordisplay via the user interface.
 16. The non-transitory computer-readablemedium of claim 15, wherein, to obtain the selected subset of therequested telemetry data, the instructions are executable by the atleast one processor to: obtain a first portion of the selected subset ofthe requested telemetry data from a first node of the plurality of nodesin the data management system; and obtain a second portion of theselected subset of the requested telemetry data from a second node ofthe plurality of nodes in the data management system, wherein the secondnode is different from the first node.
 17. The non-transitorycomputer-readable medium of claim 15, wherein the instructions areexecutable by the at least one processor to: receive the query at afirst node of the plurality of nodes in the data management system whilesome or all of the requested telemetry data is stored at a second nodeof the plurality of nodes in the data management system, wherein thesecond node is different from the first node.
 18. The non-transitorycomputer-readable medium of claim 17, wherein the instructions arefurther executable by the at least one processor to: transmit at least aportion of the requested telemetry data from the second node to thefirst node in response to the query.
 19. The non-transitorycomputer-readable medium of claim 15, wherein, to receive the query atthe data management system, the instructions are executable by the atleast one processor to: receive, via the user interface, a request todisplay telemetry data associated with a designated time range, whereinthe selected subset of the requested telemetry data comprises a subsetof the plurality of data points that correspond to the designated timerange.
 20. The non-transitory computer-readable medium of claim 15,wherein, to select the subset of the requested telemetry data, theinstructions are executable by the at least one processor to: determinethat a variance of a data point with respect to a previous data point ora subsequent data point is above a threshold, wherein the requestedtelemetry data comprises the previous data point, the data point, andthe subsequent data point; and select the data point to include in thesubset of the requested telemetry data based at least in part on thedetermination.